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Abstract 

In this paper we consider the problem of collectively classi- 
fying entities where relational information is available across 
the entities. In practice inaccurate class distribution for each 
entity is often available from another (external) classifier. 
For example this distribution could come from a classifier 
built using content features or a simple dictionary. Given 
the relational and inaccurate external classifier information, 
we consider two graph based settings in which the problem 
of collective classification can be solved. In the first setting 
the class distribution is used to fix labels to a subset of nodes 
and the labels for the remaining nodes are obtained like in a 
transductive setting. In the other setting the class distribu- 
tions of all nodes are used to define the fitting function part 
of a graph regularized objective function. We define a gen- 
eralized objective function that handles both the settings. 
Methods like harmonic Gaussian field and local-global con- 
sistency (LGC) reported in the literature can be seen as spe- 
cial cases. We extend the LGC and weighted vote relational 
neighbor classification (WvRN) methods to support usage of 
external classifier information. We also propose an efficient 
least squares regularization (LSR) based method and relate 
it to information regularization methods. All the methods 
are evaluated on several benchmark and real world datasets. 
Considering together speed, robustness and accuracy, exper- 
imental results indicate that the LSR and WvRN-extension 
methods perform better than other methods. 

1 Introduction 

Traditionally classifiers are built using only local fea- 
tures of individual entities such as web pages or images. 
Relational classifiers also make use of relational infor- 
mation that exist across the entities. For example, local 
features of a web page could be collection of keywords 
that appear in the title or page content and useful rela- 
tional information could be scores computed from pres- 
ence/absence of inlinks and/or outlinks, similarity of 
page structure, url etc. In relational classification prob- 
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lems, a collection of entities is represented as a graph 
or a network of nodes. Each node represents an entity 
with its set of local features, class attribute informa- 
tion and edge weight encodes any relational information 
that exists between those connected nodes. In its gen- 
eral setup the graph contains zero or more labeled nodes 
and one or more unlabeled nodes and the goal is to la- 
bel a set of unlabeled nodes (test nodes) . This problem 
is either solved by treating it as an induction problem 
where a model (possibly statistical) model is learnt us- 
ing training data (labeled nodes) and used to classify 
future data, or, as a transduction problem where classi- 
fication is needed(done) only on the test nodes. The key 
aspect of this relational classification problem is making 
collective inference of the class labels, that is, labels of 
all the nodes are obtained simultaneously. See [5], @] 
and references there for more details. 

In this paper we consider a related relational learn- 
ing problem where, instead of a subset of labeled nodes, 
we have inaccurate external label/class distribution in- 
formation for each node. This problem arises in many 
web applications. Consider, for example, the problem 
of identifying pages about Public works, Court, Health, 
Community development, Library etc. within the web 
site of a particular city. The link and directory rela- 
tions contain useful signals for solving such a classifica- 
tion problem. Note that this relational structure will 
be different for different city web sites. If we are only 
interested in a small number of cities then we can afford 
to label a number of pages in each site and then apply 
transductive learning using the labeled nodes. But, if 
we want to do the classification on hundreds of thou- 
sands of city sites, labeling on all sites is expensive and 
we need to take a different approach. One possibility 
is to use a selected set of content dictionary features 
together with the labeling of a small random sample 
of pages from a number of sites to learn an inaccurate 
probabilistic classifier, e.g., logistic regression. Now, for 
any one city web site, the output of this initial classifier 
can be used to generate class distributions for the pages 
in the site, which can then be used together with the 



relational information in the site to get accurate classi- 
fication. 

The problem of doing relational learning together 
with externally available class distribution information 
can be solved by modifying existing transductive meth- 
ods. The problem has been discussed within broad 
sets of techniques such as denoising, relaxation label- 
ing, metric labeling (see [3] and references there) etc., 
as well as within recent specific techniques such as the 
Gaussian Field Harmonic Function (GFHF) method [9] 
and the Information Regularization (IR)[3], dual IR 
(DIR) 6] methods. We also consider two more methods, 
viz., the Local-Global Consistency (LGC) method [8] 
and the probabilistic Weighted vote Relational Neighbor 
(WvRN) Classification method with Relaxation Label- 
ing [3] . The main aim of this paper is to take a few tech- 
niques (the last five specific methods mentioned above) 
that are popularly discussed and used in recent rela- 
tional learning literature, extend them as needed and 
compare in different settings to see which ones are most 
effective in solving our problem. The proposed exten- 
sions and method include supporting external classifier 
information for the LGC and WvRN methods, and us- 
ing least squares (LS) divergence measure (referred to 
as the LSR method) as opposed to the KL-divergence 
measure used in the IR methods. We also establish the 
relation between the LSR method and the IR methods. 

Our problem may be solved in two different settings. 
In the first setting we select a subset of nodes for which 
we have high confidence in the class labels and fix the 
labels of these nodes. For example we may select the 
nodes based on thresholding its probability or decision 
function score above a certain value. We can treat 
the remaining nodes as unlabeled and solve it as a 
transduction problem in which relational information 
is used to propagate labels from the labeled set to the 
unlabeled set through the connections according to their 
strengths. In the second setting we make use of the 
external class distribution of all the nodes fully and no 
nodes are treated as unlabeled. For different methods 
this is done in different ways. 

Detailed experiments on four benchmark datasets 
indicate that the second solution setting (full use of class 
distribution information) is better. Further experiments 
(in setting 2) on real world shopping domain datasets 
clearly demonstrate that significant performance gain 
can be achieved with the proposed methods. Consid- 
ering speed, robustness and accuracy together, the ex- 
perimental results indicate that the LSR and WvRN 
extension methods are better than the other methods. 

The paper is organized as follows. Section 2 for- 
mulates the problem and describes the two solution set- 
tings. The methods are grouped as function estimation 



and probability distribution based methods; details on 
how they are modified for the two settings are given 
in sections 3 and 4. In section 5 we give experimental 
results, and summarize key observations in section 6. 

2 Solution Settings. 

In this section we present the statement of the problem 
and discuss two solution settings in which the problem 
can be solved. Then we briefly mention the methods 
and their extensions considered in this paper. 

2.1 Notations and Problem Formulation. As- 
sume that we have a graph G with vertices V and 
edges E. Let the vertices (nodes) Vi, i G N where 
N = {1, ...,n} represent the entities. Let the edges 

j (where i,j € N) encode relational information be- 
tween node Vi and Vj, with weight Wij. Let W denote 
the weight matrix with Wi^ as its elements. Let A be 
the graph Laplacian matrix defined as A = D — W [5] 
and let its unnormalized and normalized versions be de- 
noted as: h un = A and L„ rm = D sAD 2 = I 
D sWD 2 respectively. D is a diagonal matrix with 
ith diagonal element defined as da = y]. Wij and da 
measures degree of ith node. In a traditional problem 
formulation, labeling of Vi corresponds to specifying its 
class label Cj where Cj € {1, . . . , K} and K is the num- 
ber of classes. A related quantity is the vector = 5% Xi 
where 8 is the Kronecker delta function. Let Y = 
[yi , . . . , y„] . Alternatively, one can specify the class 
distribution information where = [pi,i, • ■ ■ ,Pi,K] T 
with Y2kPi,k = 1- Let P = [pi, . . . , p„]. The case where 
only class label information (e,, y, and Y) are given 
may be viewed as a special case of the class distribuion 
view with P = Y. Therefore without loss of generality 
we assume that we are given P and we can derive the 
class label for node Vi as Cj = ar9 ™ ax Pi.k- Thus, given 
P, we can obtain derived labels Ci and obtain the cor- 
responding performance (e.g., accuracy) Per/(P). In 
this paper we are mainly concerned with effective use 
of an initial P (that is available from some external 
means) and using it together with the relational infor- 
mation to do better. In such a case we call Perf(P) 
as the initial performance of the classifier. The clas- 
sification problem of using relational information (W) 
and initial class distribution (P) can be loosely stated 
as follows: given (G, W,P), find P such that Per/(P) 
is better than Perf(P). Given the above formulation 
we consider two settings in which the problem can be 
solved. 

2.2 Solution Setting 1. In the first setting the ex- 
ternal class distribution information available with all 
the nodes (P) is first used to select a subset of la- 
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beled nodes, S. We treat the remaining nodes as un- 
labeled. Then, for each i £ S we fix the label of node 
i as argmaxfcPj ^. Once this is done, the problem can 
be solved using any standard graph based transductive 
or semi-supervised learning method. However, while 
conventional semi-supervised learning or transduction 
problem settings usually assume that Pi or y, is accu- 
rate, this is not the case here. The selection of subset 
of nodes is an important aspect and will be addressed 
later when we discuss specific methods. 

2.3 Solution Setting 2. In the second setting the 
external class distribution information of all the nodes 
is used in the solution process. Exactly how this is done 
will be detailed in later sections. 

2.4 Existing Methods and Extensions. Graph 
based transduction methods fall in one of two groups: 
(1) those which are based on function estimation and (2) 
those which are based on probability distribution esti- 
mation. In this paper we consider the following methods 
from each group: the Gaussian Field Harmonic Func- 
tion (GFHF) method [9] and the Local-Global Consis- 
tency (LGC) method [5] from the first group and the 
Information Regularization (IR) [2], dual IR (DIR) [S] 
methods and the probabilistic Weighted vote Relational 
Neighbor (WvRN) Classification method with Relax- 
ation Labeling [4] from the second group. While all 
these five methods can easily solve the problem in the 
first setting, the second setting is (briefly) discussed in 
the literature only for GFHF, IR and DIR methods. In 
the following sections we extend LGC and WvRN meth- 
ods to handle the second setting. We also propose an ef- 
ficient least squares regularization (LSR) based method 
as compared to the KL-divergence based IR methods 
and relate these methods. 

3 Function Estimation Methods 

Graph based classification methods in this group find 
the classification function by minimizing the function: 



(3.1) 



Q(F) = Qgr(F) + CQ datafit (F,Y) 



where C is a positive regularization constant. The ob- 
jective function consists of two terms. The first term 
known as the graph regularization (GR) term makes 
use of the weight matrix W and imposes smoothness 
of the function over the graph. The second term is de- 
pendent on the known label information and measures 
deviations from model implied label information. Con- 
sequently, this term is often called the data fitting term. 
In the traditional transductive setting, labels for a sub- 
set of nodes, S are given and the remaining labels need 
to be inferred. For example, in the GFHF method, the 



function F is set to Y for the labeled nodes S and is es- 
timated for the unlabeled nodes by minimizing only the 
graph regularization term Q G r{F) = EfeLi F fc L ™ F fc- 
In the LGC method, the function F is estimated by min- 
imizing CX)fc=i F^L nrm F fe + ||F fc - Y fc || 2 where Y is 
set to zero for the unlabeled nodes. 

As explained in section 2, applying the methods in 
solution setting 1 is straightforward. The selection of 
S will be discussed below. Some care is needed when 
extending the methods to deal with solution setting 2, 
which aims to make effective use of the class distribution 
information P. It makes good sense to set Y = P in 



( 3.1 1 and thus try and force F to be close to P while also 



ensuring its smoothness on the graph. It is also a good 
idea to try a bit more and use the uncertainty present 
in the class distribution information to apply different 
weights to different data fitting terms. With this in 
mind we give a generic quadratic formulation with two 
parameters (H and A) . Original versions of the GFHF 
method and the LGC method follow as special cases. 
Also, choosing the parameters differently using P leads 
to extensions of these methods to solution setting 2. 

3.1 Generic Quadratic Objective Function. We 

define the following generic quadratic objective function 
to estimate F and we shall see that the objective 
functions used in GFHF and LGC methods are special 
cases of this objective function. 
(3.2) 

K 

Q g {F) = ^C(F fe -VAY fc ) T H(F fe -VAY fe )+F^LF fe 

k=l 

where F = [Fi • • • Fk] and Yfc is fcth column of Y. On 



comparing (3.2) and (3.1| we see that the data fitting 



term is nothing but a weighted quadratic error function 
and the graph regularization term is another quadratic 
function defined in terms of graph Laplacian matrix L. 

Let us consider the weighted quadratic error func- 
tion. We have introduced two parameters: a generic er- 
ror weight matrix H and a label degree matrix A. The 
role of H is to give different weightage to individual er- 
rors (data fitting terms) and is assumed to be positive 
definite. When H is diagonal the data fitting term is 
essentially a weighted sum of squared errors with er- 
ror in the ith node weighted by h u . A is a diagonal 
matrix and its role is to incorporate any label degree 
information we want to associate with each node where 
< Xu < 1. For example if node i is unlabeled then 
Xu = and when it is labeled fully Xu = 1. Note that 
the interval < Xu < 1 brings in the notion of partial 
labeling and provides flexibility when (ith row of Y) 
is inaccurate. 

Finally, V is a node regularization matrix which 
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is also diagonal. Wang et al[7] introduced the notion 
of node regularization in the original LGC formulation 
to handle label imbalance problem in the graph. As- 
suming y, to contain only one non-zero element they 
defined normalized label matrix Z = VY with vu = 
SfcLi V *'| fc " • Here Vk = Y,7=i d ayi,k and we can see 
that 2<Li %i,k = 1- Thus V balances the influence of 
labels from different classes and allows with high de- 
gree to make more contribution. In our general set- 
ting we allow < y^ < 1, k = 1,...,K subject to 
Ui,k — 1 an d we also have the label degree matrix 
A. Therefore we redefine vu as: vu — X^fcLi d " X ^J' ,k 
where r\ k = J2i=i du^iiVi,k- Defining Z = VAY wc 
have 2i=i z i,k = 1' ^ an d, Z becomes a normalized 
matrix as earlier. 

Given the additive nature of the function Q S (F), 
F fc can be solved independently Then setting the 
derivative a ^ F ^ to zero gives the following closed form 
solution that involves matrix inversion and costs 0(n 3 ). 



(3.3) 



Ft = C (L 



CH) ^Zi 



Here Zfc is kth column of Z. In the following dis- 
cussion we assume t hat H is diagonal and is invert- 
ible. Then equation (3.3| can be rewritten as: Ffc = 
(1+ -H -1 L) Zfc. Then instead of estimating Ffc from 



(3.3| using matrix inversion, we can find it by iteratively 
using the fixed point equation: F = -H : LF + Z. 
Note that the convergence rate of the iterative solution 
depends on the eigen values of the matrix (I + iH -1 L) . 
Also the fixed point equation changes depending on 
whether normalized or unnormalized graph Laplacian 
is used and the specific form of error weight matrix H. 

3.2 GFHF Method As a Special Case. The 

GFHF method [5] was originally proposed to solve semi- 
supervised learning problem based on a Gaussian ran- 
dom field model for a weighted graph model. In this 
formulation a real valued function f is estimated by min- 
imizing a quadratic energy function (J^feLi Ffc"L un Ffc) 
subject to the constraint that the function f takes the 
actual label values on the set of labeled nodes. It is as- 
sumed that y^fc takes value in {0, 1}. This solution is 
retrieved by considering our generic quadratic formula- 
tion and setting L = L„„, V = I, H — DA(I - A) -1 
and C = 1. Further An = 1 if node i is labeled and 
is zero otherwise. Note that the specific structure of H 
and A imposes the constraints on the labeled nodes by 
assigning infinite weights to the errors on labeled nodes. 
On substituting these specific matrices we get: 



where Yfc is fcth column of the matrix Y. The corre- 
sponding fixed point equation (after including the node 
regularization matrix V) is: F k = (I — A)D _1 WF fe + 
AZ fe . 

Note that ( |3.4[ ) can handle both the solution set- 
tings presented in section 2. In a transductive setting 
with labeled and unlabeled nodes we have An = 1 for 
the labeled subset of nodes such that i £ S and An — 
for the remaining set of nodes i € S. [S] also gave a ran- 
dom walk interpretation to this method. That is, f+j 
has an interpretation as the probability that a particle 
starting from node i, hits a labeled node with label I 
(with class j being considered as 1 like in one-vs-all set- 
ting). They also discussed incorporating external class 
label information using the notion of dongle nodes. Here 
each node having label information is attached with a 
dongle node and the label value is tied to this node. 
Further a transition of probability /j, was defined from 
each node to its dongle node and all other transitions 
are covered with probability (1 — /x). 

It may be noted that ( |3.4[ ) is a general form with 
this interpretation and when A = fjl we get the original 
form of GFHF method. Thus we allow these transition 
probabilities to take different values for different nodes 
and this is very useful since we may have different levels 
of confidence on the label information we get for each 
node from the external classifier. Below we shall discuss 
how to choose these values. 

3.3 LGC Method As a Special Case and its 
Extension. Zhou et aljS] proposed a semi-supervised 
learning approach to design a smooth classification func- 
tion F such that it respects any intrinsic structure re- 
vealed by the labeled and unlabeled nodes. Starting 
with the intution of label propagation they introduced 
an iterative algorithm which essentially finds the solu- 
tion to a FP equation. With having only one non- 
zero element and by setting, in our generic quadratic 
formulation, L = L„ rm , V = I, H = I, An = 1, Mi € S 
and An = 0, Vi £ S we can retrieve the FP equation 
following the same steps in the previous section. Zhou 
et al[8] also considered several variants of fixed point 
equations and showed how one such FP equation can be 
obtained as solution from optimizing a regularized ob- 



jective function (C^j^F^L 



,Ffc + 1 1 Fj 



2 ). 



(3.4) 



Ffc = (I- (I- A)D- J W) X AYfc 



Equation (3.2 1 can be seen as a generalized version of 
such an objective function. Unlike GFHF method the 
regularization constant C is to be set using standard 
techniques like cross-validation and this is possible only 
when sufficient number of labeled nodes is available. 

Extension for Setting 2 We extend the basic 
LGC method to handle setting 2 as follows. Firstly, 
we relax y^ as described earlier. Secondly, we make use 
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of the label degree matrix A to handle inaccurate label 
information. Finally, we allow S to be the entire set of 
nodes. With specific structures of L and H defined as 
above we get: 



(3.5) 



= (1-7)(I-7D~5WD" 



AZ* 



where 7 = and it can re- written in fixed point form 
as done earlier with the GFHF method. We will discuss 
shortly how to choose C and A in the two settings. 



The solution of the fixed point equations, (3.4) and 



(3.5 1 can be obtained iteratively. In all our experiments 



we found the solution iteratively. The iterative update is 
very useful when W is sparse (which is often the case in 
practice) and the convergence rate is dependent on the 
eigen values of the matrix involved in the inversion. In 
the worst case the cost is expected to be in the order of 
0(mn 2 ) where m is a bound on the number of neighbors 
of a node in the graph. When the graph is dense, the 
complexity can be cubic in n [T]. 

3.4 Parameter Selection in Two Settings. In 

this section we present ways of selecting the subset S 



and other parameters A and C that appear in (3.2) 



We propose two scoring schemes that can be used to 
make the subset selection for solution setting 1. 

Maximum Probability Scoring Scheme Let 
p-max _ max; p. t \/j Then we sort the nodes based 
on the maximum probability score (pf 1 ^) of each node 
from its class distribution. Then the subset of nodes 
that satisfy the condition p™ ax > p t h can be selected as 
the set of labeled nodes. Essentially we select nodes 
on which we have high confidence in their labeling 
assignments. If pth is very high then the number of 
labeled nodes becomes smaller. However the noise level 
(percentage of noisy labeled nodes) will also be low and 
care has to be taken to ensure that each class has at least 
some labeled nodes. On the other hand, if the threshold 
level is low then though the number of labeled nodes 
increases the noise level also increases. Alternately, we 
can also select top-M percentage of nodes as the subset 
of labeled nodes. We will refer to the subset selection 
scheme based on the maximum probability score as 
MPS scheme. 

Entropy based Scoring Scheme In this scheme 
we sort the nodes based on an entropy based score rji for 
node i defined as: rji = 1 — E(pi)/E max where E(pi) 
represents the entropy of the class distribution pi and 
E m ax = log(A') is the maximum entropy possible for a 
given number of classes K. Thus rji lies in the interval 
[0, 1] and takes high or low values depending on the 
spread of the class distribution. This score is motivated 
from the view that we do not have any class label 
information when the class distribution is uniform and 



we are certain when pi k = 5k. Ci ■ Setting a threshold 
to select the subset of labeled nodes S is harder here 
compared to MPS scheme and as mentioned earlier, 
so we can simply select the top-M percentage of nodes 
sorted by this score. We will refer to this subset selection 
scheme as EBS scheme. 

Note that the above two schemes only use the class 
distribution information to select the subset of nodes. 
It is also possible in addition to make use of the graph 
structure information and is a direction for further work. 

Choice of A Recall that An has the interpretation 
of transition probability in GFHF method; and it has 
the interpretation of partial label information in the 
LGC-extension method. Earlier we gave two useful 
scoring schemes based only on the external classifier 
information. We can make use of either of these schemes 
to fix Ajj. That is, we can either set An = -p™- ax or Xu = 
rji. We evaluate both these schemes in our experiments. 
Note that when the subset selection scoring scheme also 
makes use of graph structure information, then we may 
not want to use the same scoring scheme to choose A. 
This is because we may like to give importance to the 
label degree or transition probability to each node only 
based on the information from the external classifier. 

Choice of C We use 5-fold cross-validation (CV) 
technique to choose C for LGC. In the first setting, for 
each value of C chosen from a range in the log scale, 
we evaluate 5-fold CV accuracy on the selected subset 
of nodes using the label information available for these 
nodes. Then we pick the value of C that gives the 
maximum accuracy and find the final solution using the 
chosen C value. Here the label information is nothing 
but the label that has the maximum probability score. 
In the second setting we evaluate 5-fold CV accuracy on 
top-M percentage of nodes selected using MPS or EBS 
scheme. It is worth pointing out that, inaccuracies in 
the external class distribution could lead to a choice of 
C that is tilted away from the best possible choice. 

4 Probabilistic Methods 

The methods in this class, viz., the information reg- 
ularization (IR), dual IR methods and the probabilis- 
tic weighted vote relational neighbor (WvRN) classifier 
method, estimate the class distributions of the nodes. 
In this section we discuss their adaptation for the two 
solution settings and also introduce least squares regu- 
larization (LSR) based method. Overall, we group the 
IR, DIR and LSR methods under the category of region 
based regularization methods. 

4.1 WvRN Method and its Extension. The orig- 
inal probabilistic weighted vote relational classifier 
(with relaxation labeling) method [4] was formulated 
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to solve the collective classification problem where class 
distributions of a subset of nodes are known and fixed. 
Then the class distributions of the remaining (unla- 
beled) nodes are obtained by an iterative algorithm. We 
give a version of this algorithm in Algorithm 4.1 It 



has two components, namely, weighted vote relational 
neighbor classifier (WvRN) component and relaxation 
labeling (RL) component. The relaxation labeling com- 
ponent performs collective inferencing and keeps track 
of the current probability estimates for all unla- 
beled nodes at each time instant t. These frozen esti- 
mates Pj are used by the relational classifier. The re- 
lational classifier computes the probability distribution 
for each unlabeled node as the weighted sum of probabil- 
ity distributions p^- of its neighbors with weight Wy. 
Since relaxation labeling may not converge, simulated 
annealing is performed to ensure convergence (as given 
in step (2) of the algorithm) @j. Note that f}® -> for 
sufficiently large number of iterations; therefore, con- 
vergence is guaranteed. It has been observed that the 
performance is robust when 0.9 < v < 1. The WvRN 



algorithm (Algorithm 4.1) can be directly used to solve 
the problem in the first setting by setting the proba- 
bility distributions associated with the selected set of 
nodes i 6 S to Sk, Ci and initializing the probability dis- 
tributions of the remaining set of nodes (S) with the 
class prior obtained using the label information of the 
selected set of nodes. Here the label for a node (ci) 
in the selected subset is defined as the label with the 
maximum probability score obtained from pi . 

Extensions for Setting 2 Next we extend the 
WvRN algorithm to solve the problem in setting 2. We 
consider two variants. In the first variant we consider a 
simpler form where we initialize p\ with the external 
classifier information for all nodes and run Algorithm 
|4.1| as it is. In the second variant we use the dongle 
node idea (used in the GFHF method) and modify the 
relational classifier estimate from to as follows. 
With Ajj representing the transition probability, we 
define q hk = hipfl + (1 - \i)<li,k (where q iyk is as 

defined in Algorithm 4.1) and pf^ is the distribution 
information available from the external classifier. We 
can select A using MPS or EBS schemes, and v using 
the cross-validation technique (see section 3.4). 

4.2 Region Based Regularization Methods Cor- 
duneanu and Jakkola [2; proposed the information regu- 
larization method. They introduced a notion of region, 
where each region is defined as a subset of nodes in the 
graph. The intuition is that nodes belonging to a given 
region have the same label. Here a weight is associ- 
ated with each region and a weight to each node that 



Algorithm 4.1 Probabilistic WvRN Algorithm 

• Set t = 0, /3(°) = 1 and i/=0.95 

• For all nodes i € S, initialize p^ to class prior 
(obtained from known labeled nodes) 

• Until convergence holds for all the nodes in S do: 

For each element i £ S and k = {1, . . . , K} 

1. Estimate node class probabil- 
ity (using neighbor information) 

sEi^yP i (where ip is a 



li.k 



normalizing constant) 



2. Set p. 



(t+i) _ fl (t) 



/3(*)<ft, fc + (1 - p^)pfl 



Set t = t + 1 and /?( t+1 ) = * . 



defines the relative importance of points that belong 
to a given region. Further, a distribution is associated 
with each region and node. Then the distributions of 
labels are propagated on a graph for semi-supervised or 
transductive learning. Wc refer to methods that work 
within this framework of region as region based regular- 
ization methods. The IR, DIR and LSR methods fall 
in this category. In graph based classification problem, 
each edge forms a region and the weight of the region is 
nothing but the edge weight; further, an equal degree of 
importance is given to both the nodes connected by the 
edge. Then the regularized optimization problem can 
be written as minimization of the objective function: 



(4.6) 

H(P;W) = Cj2^D(p ( f\ Pl ) + 



M-D(Pi.Pi,i) 



where D(-) denotes a divergence measure, Xu denotes 
relative weight factor that we would like to assign to 
i-th node in S. Further, pij denote the probability dis- 
tribution associated with the region (edge) The 
optimization problem should include the following con- 
straints: (1) < pi t k < 1 and (2) ^ZfcLi^i.fc = 1- The 
region distributions are constrained in a similar way. In 
this formulation, an alternating learning algorithm]!?] 
consists of two steps. In the first step, all the node dis- 
tributions Pi, Vi collected as a vector P arc fixed and, all 
the region distributions Pij,V(i,j) pairs are obtained 
by minimizing only the second term in (4.6). In the 



second step, using these estimated region distributions, 
the node distributions are re-estimated by minimizing 
both the terms in (4.6). These steps are repeated until 



convergence. 

Now let us look at how (4.6) is used in the two 
solution settings. In the first setting, having selected 
the subset of nodes S we fix p^/. = p'fl — £fc, Ci , Vi € S 
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and k = 1, . . . , K. Then the optimization is over Pv\s 
where we have explicitly indicated the set of nodes to 



be optimized. Therefore the first term in (4.6) does not 
play a role in the optimization process. In the second 
setting, the first term also plays a role since we optimize 
over the entire set of node distributions Py- Using this 
broad setup let us consider several methods that stem 
from different divergent measures D(-), namely, KL- 
divergence and squared error (loss). The complexity 
of all these methods is same as that of the function 
estimation methods. 

4.2.1 Information Regularization (IR) Method 

Within the above framework, Corduneanu and 
Jakkola [5] used KL-divergence as the divergence mea- 
sure D(-) and called the second term as information 
regularization term. Then closed form solution can be 
obtained for pi j and is given by: 



(4.7) 



1 



J h3 



(Pi+Pj). 



Note that PijS are the region (edge) distributions and 



are obtained from minimizing the second term in (4.6| 
with the KL-divergence measure. As mentioned earlier, 
in solution setting 1 the first term does not play a role 
in the optimization process. Then the distributions for 
the unlabeled nodes arc directly obtained as: 



(4.8) p lyk = — exp(y^ y Wjj log pjj (fc) ) 



where Pi j(k) denotes fcth component in pjj (see (4.7)) 
and Xi denotes a normalizing constant. Thus in setting 
1, closed form solutions exist in both the steps of the 
alternating learning algorithm. In solution setting 2, 
with the first term also playing a role in the optimization 



process there is no closed form solution like (4.8). Then 



constrained optimization using either Newton's method 
or exponentiated gradient algorithm is carried out. This 
step could be expensive and affects the scalability of 
the method. To address this issue, Tsuda [6] proposed 
a dual information regularization method, where closed 
form solutions are obtained in both the steps even when 



the first term in (4.6) plays a role 



4.2.2 Dual Information Regularization (DIR) 
Method Tsuda [6] modified the regularizer by inter- 
changing the arguments in the KL divergence measure 
(KL(pij\\pi)) and using modified pij given by: 



(4.9) 



1 



.1 



exp(-(log(pi) + log(pj))) 



where Zij is the normalization factor. Then the closed 
form solutions for pi are obtained as: 

1 



(4.10) 



where Da = Wij. We refer to the update equations 
(4.9) and ( |4.10 ) as the dual information regularization 
(DIR) method. Let us consider the implications in 
the two settings. In the first setting we fix the node 
distributions of the nodes in S and estimate only the 
node distributions of the nodes V \ S. This estimation 



is done using (4.9) and with \a = 0,Vi £ S in (4.10) 



In the second setting, (4.9) and (4.10) are used as they 
are. Thus, closed form solutions are available in both 
settings and in both steps of the learning algorithm and, 
this helps in improving the speed of the IR method. 

4.2.3 Least Squares Regularization (LSR) 
Method We propose to use the squared error as the di- 
vergence measure D(-); that is, for any two distributions 
p and q, we de fine D(p, q) = ||p — q|| 2 - Then it is easy 
to verify from (4.6) that the optimal pi j = \(pi + Pj) 
and is same as the one obtained in the IR method. 
In this method, we proceed as in the 2-step algorithm 
where we estimate Py keeping all the node distribu- 
tions fixed. This results in the closed form solution: 



(a mPi (0) + E, 



jPi.j)- This solution is 



exactly same as the one given in ( 4.10 ). We refer the up- 



dates (4.7) and (4.10) as the least squares regularization 



(LSR) method. Note that application of this method in 
two settings is exactly same as described above for the 
DIR method. 

Relation to other Methods The LSR objective 



function, that is, (4.6) with squared error as the diver- 



gence measure has structure similar to the generalized 



quadratic function (3.2). The key difference is that un- 
like the rows of F, the node distributions Pi and p,- ^ are 
constrained to be probability distributions. Now, com- 
paring the DIR and LSR updates, we see that they differ 
in the way region distributions are updated. Thus, the 
LSR method interestingly combines the IR based region 
distribution update with the D-IR based node distribu- 
tion update. Also, like the IR methods, the objective 
function is convex and has global minimum. 

5 Experiments 

We conducted two experiments on the two function es- 
timation methods, LGC and GFHF, and the four prob- 
abilistic methods, WvRN, IR, DIR and LSR. In differ- 
ent settings the acronyms of methods will appropriately 
refer to the modifications that we described earlier in 
the paper. For WvRN in solution setting 2 there are 
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Table 1: Public Datasets Description, n, K, \E\ and d 
denote the number of nodes, classes, edges and dimensions 
respectively. 



Dataset 


n 


K 


\E\ 


d 


WebKB 


1051 


2 


269046 


4840 


USPS 


3874 


4 


6398 


256 


CoraCite 


4270 


7 


22516 




CoraAll 


4270 


7 


71824 





two versions, which we will refer to as WvRN-Vl and 
WvRN-V2 (see section 4.1 and 4.2). In the first ex- 
periment we evaluate the performance of the methods 
in solution settings 1 and 2 with four publicly available 
benchmark datasets. Next, we present results from the 
second experiment where we evaluated the performance 
in solution setting 2 on datasets constructed from web 
pages of three shopping sites. 

5.1 Datasets Description. The details of the 
datasets are given in Table 1. The WebKB dataset con- 
tains two document categories, course and noncourse. 
Each document has two types of information, the web- 
page text content and link information. The number 
of features in page and link representations are 3000 
and 1840 respectively. Following [7] we constructed the 
graph based on cosine distance neighbors with Gaus- 
sian weights and chose 200 nearest neighbors. The 
USPS dataset is a handwritten digit recognition task, 
for which we used the same setting as given in [5]. The 
number of features is 256, obtained from a 16 x 16 im- 
age. The four classes correspond to digits 1 to 4. The 
graph is constructed using a radial basis function ker- 
nel with width set to 1.25; the number of neighbors is 
set to 1. The CORA dataset comprises computer sci- 
ence research papers. There are seven classes associ- 
ated with the papers; the classes correspond to the fol- 
lowing machine learning sub-topics: Case-based Meth- 
ods, Genetic Algorithms, Neural Networks, Probabilistic 
Methods, Reinforcement Learning, Rule Learning and 
Theory. The dataset consists of the full citation graph 
with labels for the topic of each paper. There are two 
variants of this dataset, referred to as CORACITE and 
CORAALL. These variants come from the way the pa- 
pers are linked. The CoraCite variant uses only citation 
link and an edge is placed between two papers if one 
cites the other. The weight of an edge is normally one 
unless the two papers cite each other, in which case it is 
two. The CoraAll variant uses both citation and author 
link information, where an edge is placed (in addition 
to co-citation) when two papers have author relation 
as well. We used the CORA dataset versions available 



from Netkit package described in [4] . 

5.2 Classifier Information Generation. While 
these datasets have only label information we need in- 
accurate external classifier (distribution) information in 
our problem formulation. Therefore we need a model 
which generates this distribution given the label infor- 
mation. We now describe the model used in our ex- 
periments. We fix two probability parameters p m in and 

Pmax 

with p mm < pmax that take values in [0, 1]. Given 
Pmin and Pmax we generate the distribution for each 
node as follows. In the first step generate a random 
number pi ahe i from the interval \p m in,Pma.x] and treat 
p c . as the probability score of the true label (cj). Then 
we generate K — 1 random numbers {pTk with k ^ Cj) 
from the interval [0, 1] and assign pf. = prk ^ where 
if) is a normalizing constant such that X)fc^ c Pk — ^~Pci ■ 
The choices of p m i n and p m ax determine the degree of 
inaccuracy present in the information. If p m in is set too 
low many nodes get labeled wrongly. Also note that 
the number of classes play a role in determining the de- 
gree of accuracy since the mass 1 — p Ci gets distributed 
across K — 1 number of classes. We considered different 
levels of inaccuracy by setting different values for p m in 
and Pmax- As the observations were almost the same 
across the methods and settings for different levels of 
inaccuracy, we present results only for one set of values. 

5.3 Experiment With Setting 1. We conducted 
the experiment with the two function estimation meth- 
ods, LGC and GFHF, and the four probabilistic meth- 
ods, WvRN, IR, DIR and LSR. We measured the classi- 
fier accuracy on the entire set of nodes. This experiment 
was conducted with both maximum probability scoring 
(MPS) and entropy based scoring (EBS) subset selec- 
tion schemes. The selected subset size \S\ was varied 
from top 10 percent to 90 percent (as per the chosen 
selection scheme). 

Let us discuss parameter selection. Recall that in 
the GFHF method C is set to 1. In the IR, DIR and 
LSR methods only the second term is present (once we 
fix the probability distribution of labeled nodes) . In the 
WvRN method we set v = 0.95. Thus only the LGC 
method required tuning of the regularization parameter 
C. For selecting C we used 5-fold CV on the labeled 
nodes and varied C in the range [0.00153, 100] (doubling 
in each step). As mentioned in section 3.4.3, note that 
the 5-fold evaluation used here is inaccurate since it is 
based on noisy information. Therefore, the C estimate 
can be inferior sometimes. 

The average classifier accuracy was computed as 
the average of accuracies obtained from 100 random 
partitions of the graph. The results for one parameter 
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Figure 1: Setting 1 Results: X-axis represents the subset size in terms of percentage of number of nodes and 
Y-axis represents accuracy. The results correspond to the EBS subset selection scheme. Results for the MPS 
subset selection scheme are not given because of their closeness with EBS; see text. Initial refers to accuracy 
computed from generated inaccurate distribution. 



setting with the MPS subset selection scheme for all 
the datasets are given in Figure [T] we set p max = 0.99 
for all the datasets and set p m i n as 0.4, 0.2 and 0.1 for 
WebKB, USPS and CORA datasets (both CORACITE 
and CORAALL) respectively. 

From Figure [T] we see that only on the WebKB 
dataset all the methods are able to perform better 
than the initial accuracy; in fact, even on this dataset, 
WvRN and GFHF methods fall short when the subset 
size 15*1 is small. All methods other than LGC give 
almost the same accuracy as \S\ becomes larger, on 
all datasets; of course, the exact value of |5| at which 
they converge vary from one dataset to another. On 
the CORA datasets WvRN performed better followed 
by GFHF, LSR, DIR, IR and LGC when |5| is small. 
On comparing EBS and MPS selection schemes we 
found that the EBS scheme performed slightly better 
(around 2%) on CORA datasets, while the performances 
were almost same on WebKB dataset; the EBS scheme 
performed slightly inferior on USPS dataset. Except 
for these variations, the behavior of the EBS scheme 
as a function of \S\ was almost same as that of the 
MPS scheme; therefore, only the results for the EBS 
scheme are shown in the figure. We analyzed the inferior 
performance of LGC on CORA datasets and observed 
that incorrect choice of C due to noisy CV estimates 
was the reason behind it. 



5.4 Experiment With Setting 2. We conducted 
the experiment with LGC, GFHF, IR, DIR, LSR, 
WvRN- VI and WvRN-V2. The classifier accuracy 
measurement, evaluation with two selection schemes, 
subset size variation and p m in and p m ax settings remain 
the same as in setting 1. In this setting IR, DIR, LSR 
and WvRN-V2 also require parameter tuning (that is, C 
and v). In the case of WvRN- VI there is no parameter 
tuning and we set v — 0.95. In the case of GFHF we 
set C = 1 as done earlier. All parameter tunings are 
done using 5-fold CV, using top M percent of selected 
nodes, as in setting 1. It is useful to recall that the 
main difference of setting 2 from setting 1 is that the 
distribution information from all nodes is used during 
the solution process. We used the same range for C as 
in setting 1 for the LGC method. In the case of WvRN- 
V2 method we used the same range but set v = . In 
the modified IR method we selected C from the range 
[0.0625,312.5] (doubling in each step). In the case of 
LSR and DIR methods we selected C from the range 
[0.078,10] (doubling in each step). Finally the average 
accuracy was computed from 100 random partitions of 
the graph for all the methods except IR. 

Efficiency The objective function optimization in 
the case of IR takes significantly longer time (an order of 
magnitude) compared to other methods in this setting 
due to the nonlinear optimization involved. Therefore, 
only for IR, we computed the average performance from 
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100, 20 and 10 partitions for USPS, WebKB and CORA 
datasets respectively. In terms of speed, WvRN, LGC, 
GFHF, LSR, DIR and IR methods come in that order. 
Although the update equations for the DIR and LSR 
methods look similar, the DIR method took more time 
to converge than the LSR method and we observed 
that the LSR method was approximately 6-10 times 
faster than the DIR method. On the other hand, the 
LSR method was slower by 2-4 times compared to the 
remaining methods. 

Choice of C from Cross- Validation The choice 
of C from the noisy CV accuracy estimates has an 
important effect on the classification performances of 
LGC, IR, DIR and LSR methods; this effect varies 
across datasets. In these methods we observed that infe- 
rior performances were obtained when we chose C that 
gave the best CV accuracy. To get improved and ro- 
bust performances, we recommend an alternate way to 
choose C: choose the smallest C such that the corre- 
sponding CV accuracy estimate was within, say, 5% of 
the best CV accuracy. This helps because the varia- 
tion of CV accuracy estimate as a function of C can 
be quite flat around the best CV accuracy and, in such 
cases choosing the least C within a specified accuracy 
estimate regularizes the solution better. The results re- 
ported in Figure 2 are the improved performances ob- 
tained using the recommended way of choosing C de- 
scribed above; some inferior performances can be still 
seen with IR on the WebKB, USPS datasets (particu- 
larly for low values of |S|) and LGC on CORA datasets. 
From this viewpoint, the results suggest that DIR, LSR 
and WvRN-V2 are more robust compared to IR and 
LGC. 

Classifier Performance From figure 2 we clearly 
see that WvRN- VI, WvRN-V2, IR, DIR and LSR meth- 
ods improved the performance significantly over the ini- 
tial accuracy on all datasets. The LGC-extension and 
modified GFHF methods improved the performance sig- 
nificantly on WebKB and USPS datasets. The perfor- 
mance curves of GFHF and WvRN-Vl remain flat in 
each plot. This is because there is no parameter tuning 
involved in these methods. For the other methods the 
choice of C made remained almost same for all |S|. Con- 
sidering the performance on all the datasets the EBS 
scheme seems to be more robust compared to the MPS 
scheme, although slightly inferior performance is seen 
on the CORAALL dataset. Note that the performance 
variations over |S| is lesser with the EBS scheme partic- 
ularly on the WebKB dataset. We believe this behav- 
ior is due to the conservative estimate of \u given by 
this scheme compared to the MPS scheme in the noisy 
scenario. Clearly, compared to the first setting, the sec- 
ond setting using distribution information of all nodes 



during the solution process enhances the performance 
significantly. Though function estimation methods per- 
formed quite close on two datasets, the probabilistic 
methods performed better considering all the datasets. 
Among the probabilistic methods, the performances of 
IR, DIR and LSR were quite close as |S| became large. 
Comparing WvRN-V2 and WvRN-Vl, since the per- 
formance difference is significant (>4% in many cases), 
WvRN-V2 is to be preferred. Since the performance 
variation across C in the WvRN-V2 method is not much 
it seems that much of the gain comes from using don- 
gle nodes. Recall that GFHF and WvRN methods have 
almost same performances on all datasets in the first 
setting; however, in setting 2, distinctly different per- 
formances of these methods are seen, particularly on 
CORA datasets. More investigation is needed to under- 
stand this behavior. Finally, the performance improve- 
ment is dependent on the quality of relational graph 
(with respect to the assumption of strong connectivity 
of nodes belonging to same class) and initial accuracy. 

5.5 Experiment with Shopping Datasets. We 

evaluated the performances of DIR, LGC, WvRN- 
V2 and LSR in solution setting 2 on real world 
datasets from three shopping sites (www.compusa.com, 
www.uncommongoods.com and www.walmart.com) . We 
considered two binary classification problems: product 
detail vs non-product and product listing vs non-listing. 
While the product pages are about one specific product 
like canon camera with certain model number in more 
detail, the listing pages are about several products of 
same cateogory (for example, different models of canon 
camera) arranged as a list in each page. These problems 
along with their site names are referred to as CU-D, 
CU-L, UG-D, UG-L, WM-D and WM-L respectively in 
Table 2. In each problem, the class distribution score 
for each page was obtained using an external classifier 
(EC) based on content features. The relational graph 
was constructed using structural signature (shingle) ob- 
tained using html tags of web pages. An edge between 
two pages was formed when their structural signatures 
had a match score of at least 6 (the values are in the 
range [0,8]) and a unit weight was assigned to each such 
edge. Also, each node was connected to a maximum of 
20 other nodes. A subset of pages (nodes) in each site 
(graph) was manually labeled; the number of labeled 
nodes is indicated as L in Table 2. The classifier ac- 
curacy was evaluated for the external classifier and the 
various methods on these labeled nodes. The results are 
given in Table 2. 

In almost all the cases the EBS scheme performed 
better compared to the MPS scheme. The results given 
in Table 2 are the best performances obtained over the 
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Table 2: Experiment: Shopping Datasets. n, L and \E\ denote the number of nodes, labeled nodes and edges in the graph 
respectively. The classificiation accuracy evaluated on the labeled nodes for the external classifier (EC), LGC, WvRN-V2, 
LSR and DIR methods are given. 



Problem 


n 


L 


\E\ 


EC 


LGC- Ext 


WvRN-V2 


LSR 


DIR 


CU-D 


53023 


1433 


1186190 


77.18 


77.88 


77.74 


78.02 


78.02 


CU-L 


53023 


1433 


1186190 


91.70 


93.79 


93.93 


94.21 


94.21 


UG-D 


82027 


1166 


1714228 


94.51 


97.68 


94.51 


95.54 


95.80 


UG-L 


82027 


1166 


1714228 


81.90 


91.51 


85.93 


91.25 


91.51 


WM-D 


67997 


1250 


1318903 


93.20 


94.48 


95.12 


95.52 


95.60 


WM-L 


67997 


1250 


1318903 


93.36 


94.72 


94.72 


95.60 


95.04 



percentage of subset sizes (|S|) and C values. There 
were minor variations (0.5 — 2%) in the performances 
over (|S|) and choices of C. The results clearly demon- 
strate that significant performance improvement (as 
high as 10%) can be achieved. Also, the usefulness of 
the proposed extensions and LSR method as effective 
alternate methods can be seen from comparison with 
the DIR method. 

6 Summary 

We considered the problem of collectively classifying 
entities where relational information and inaccurate 
class distribution information from another (external) 
classifier are available. We present below a list of key 
observations from the experimental studies conducted 
on several benchmark and real world datasets. 

• Of the two solution settings evaluated, the second 
setting (which uses external classifier information 
of all nodes) is better. Using this second setting 
a significant improvement over the external classi- 
fier performance can be achieved using relational 
information. 

• For parameter selection, the entropy based se- 
lection scheme was observed to be more robust 
compared to the maximum probability selection 
scheme. 

• With respect to choice of C using inaccurate CV 
estimates, the DIR, LSR and WvRN-V2 methods 
were observed to be more robust. 

• Overall, the probabilistic methods fared better 
compared to the function estimation methods. 
Within the probabilistic methods, the IR methods 
were competitive to the other methods. Within the 
proposed methods, the LSR, WvRN-V2 and LGC 
ranked better, in that order. 

• In terms of speed, the original IR method was quite 
slow compared to other methods. Although the 



DIR method was faster, it was still not competitive 
in speed to the proposed methods. Among the 
proposed methods, the WvRN-extensions, LGC 
and LSR methods were faster, ranked by speed in 
that order. 

• Overall, considering the issues of speed, robustness 
and accuracy together, the LSR and WvRN-V2 
methods performed the best. 
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Figure 2: Setting 2 Results: X-axis represents the subset size in terms of percentage of number of nodes and 
Y-axis represents accuracy. The left and right columns correspond to MPS and EBS subset selection schemes 
respectively. The legends for the CORA datasets are same as the one given for the WebKB and USPS datasets. 
Initial refers to accuracy computed from generated inaccurate distribution. 
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